Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

نویسندگان

چکیده

This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending HuSpaCy toolkit with several improvements to its architecture. Compared existing NLP tools Hungarian, all our pipelines feature basic steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological lemmatization, dependency parsing named entity recognition high accuracy throughput. We thoroughly evaluated proposed enhancements, compared demonstrated competitive new preprocessing steps. All experiments are reproducible freely available under permissive license.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual, Efficient and Easy NLP Processing with IXA Pipeline

IXA pipeline is a modular set of Natural Language Processing tools (or pipes) which provide easy access to NLP technology. It aims at lowering the barriers of using NLP technology both for research purposes and for small industrial developers and SMEs by offering robust and efficient linguistic annotation to both researchers and non-NLP experts. IXA pipeline can be used “as is” or exploit its m...

متن کامل

Automatic Evaluation and Composition of NLP Pipelines with Web Services

We describe the innovative use of describing an existing natural language “pipeline” using the Semantic Web, and focus on how the performance and results of the components may be described. Earlier work has shown how NLP Web Services can be automatically composed via Semantic Web Service composition, and once the results of NLP components can be stored directly, they can also be used to direct ...

متن کامل

Building reliable and efficient data transfer and processing pipelines

Scientific distributed applications have an increasing need to process and move large amounts of data across wide area networks. Existing systems either closely couple computation and data movement, or they require substantial human involvement during the end-to-end process. We propose a framework that enables scientists to build reliable and efficient data transfer and processing pipelines. Ou...

متن کامل

Joining Statistics with NLP for Text Categorization

Automatic news categorization systems have produced high accuracy, consistency, and flexibility using some natural language processing techniques. These knowledge-based categorization methods are more powerful and accurate than statistical techniques. However, the phrasal pre-processing and pattern matching methods that seem to work for categorization have the disadvantage of requiring a fair a...

متن کامل

Improving Biomedical Text Categorisation with NLP

Background: Text categorisation has been used in bioinformatics to help identify documents containing protein-protein interactions. Standard text categorisation methods have used the bag-of-words approach with little input from NLP. While this has proved effective in the past, there is some evidence that the techniques are not adequate in some biological domains. Here we examine how chunking, n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2023

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-031-40498-6_6